Multimodal sentiment analysis involves discerning the emotions of speakers through diverse features such as sound, text, and images. However, current research in this field heavily relies on extensive supervised datasets, demanding substantial labor and computational resources. This study introduces a meta-learning-based approach for multimodal sentiment analysis, aiming to delve into emotional information within movie scenes. Leveraging meta-learning techniques, this approach seeks to accurately capture emotions in movie scenes using a limited number of annotated samples, achieving model generalization under constrained labeled data. Specifically, we introduce an optimization-based meta-learning approach for the multimodal sentiment analysis tasks in text and vision, enhancing the model’s ability to generalize to new tasks with limited annotations. Additionally, an innovative strategy based on Meta-Prompting is proposed to handle multimodal data. The merits of the proposed method are validated across different datasets.